{
  "cells": [
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "%matplotlib inline"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n# How to use ``sampling_strategy`` in imbalanced-learn\n\nThis example shows the different usage of the parameter ``sampling_strategy``\nfor the different family of samplers (i.e. over-sampling, under-sampling. or\ncleaning methods).\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Authors: Guillaume Lemaitre <g.lemaitre58@gmail.com>\n# License: MIT"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "print(__doc__)\nimport seaborn as sns\n\nsns.set_context(\"poster\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## Create an imbalanced dataset\n\nFirst, we will create an imbalanced data set from a the iris data set.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from sklearn.datasets import load_iris\nfrom imblearn.datasets import make_imbalance\n\niris = load_iris(as_frame=True)\n\nsampling_strategy = {0: 10, 1: 20, 2: 47}\nX, y = make_imbalance(iris.data, iris.target, sampling_strategy=sampling_strategy)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import matplotlib.pyplot as plt\n\nfig, axs = plt.subplots(ncols=2, figsize=(10, 5))\nautopct = \"%.2f\"\niris.target.value_counts().plot.pie(autopct=autopct, ax=axs[0])\naxs[0].set_title(\"Original\")\ny.value_counts().plot.pie(autopct=autopct, ax=axs[1])\naxs[1].set_title(\"Imbalanced\")\nfig.tight_layout()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "### Using ``sampling_strategy`` in resampling algorithms\n\n## `sampling_strategy` as a `float`\n\n`sampling_strategy` can be given a `float`. For **under-sampling\nmethods**, it corresponds to the ratio $\\\\alpha_{us}$ defined by\n$N_{rM} = \\\\alpha_{us} \\\\times N_{m}$ where $N_{rM}$ and\n$N_{m}$ are the number of samples in the majority class after\nresampling and the number of samples in the minority class, respectively.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "import numpy as np\n\n# select only 2 classes since the ratio make sense in this case\nbinary_mask = np.bitwise_or(y == 0, y == 2)\nbinary_y = y[binary_mask]\nbinary_X = X[binary_mask]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.under_sampling import RandomUnderSampler\n\nsampling_strategy = 0.8\nrus = RandomUnderSampler(sampling_strategy=sampling_strategy)\nX_res, y_res = rus.fit_resample(binary_X, binary_y)\nax = y_res.value_counts().plot.pie(autopct=autopct)\n_ = ax.set_title(\"Under-sampling\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "For **over-sampling methods**, it correspond to the ratio\n$\\\\alpha_{os}$ defined by $N_{rm} = \\\\alpha_{os} \\\\times N_{M}$\nwhere $N_{rm}$ and $N_{M}$ are the number of samples in the\nminority class after resampling and the number of samples in the majority\nclass, respectively.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.over_sampling import RandomOverSampler\n\nros = RandomOverSampler(sampling_strategy=sampling_strategy)\nX_res, y_res = ros.fit_resample(binary_X, binary_y)\nax = y_res.value_counts().plot.pie(autopct=autopct)\n_ = ax.set_title(\"Over-sampling\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## `sampling_strategy` has a `str`\n\n`sampling_strategy` can be given as a string which specify the class\ntargeted by the resampling. With under- and over-sampling, the number of\nsamples will be equalized.\n\nNote that we are using multiple classes from now on.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "sampling_strategy = \"not minority\"\n\nfig, axs = plt.subplots(ncols=2, figsize=(10, 5))\nrus = RandomUnderSampler(sampling_strategy=sampling_strategy)\nX_res, y_res = rus.fit_resample(X, y)\ny_res.value_counts().plot.pie(autopct=autopct, ax=axs[0])\naxs[0].set_title(\"Under-sampling\")\n\nsampling_strategy = \"not majority\"\nros = RandomOverSampler(sampling_strategy=sampling_strategy)\nX_res, y_res = ros.fit_resample(X, y)\ny_res.value_counts().plot.pie(autopct=autopct, ax=axs[1])\n_ = axs[1].set_title(\"Over-sampling\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "With **cleaning method**, the number of samples in each class will not be\nequalized even if targeted.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "from imblearn.under_sampling import TomekLinks\n\nsampling_strategy = \"not minority\"\ntl = TomekLinks(sampling_strategy=sampling_strategy)\nX_res, y_res = tl.fit_resample(X, y)\nax = y_res.value_counts().plot.pie(autopct=autopct)\n_ = ax.set_title(\"Cleaning\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## `sampling_strategy as a `dict`\n\nWhen `sampling_strategy` is a `dict`, the keys correspond to the targeted\nclasses. The values correspond to the desired number of samples for each\ntargeted class. This is working for both **under- and over-sampling**\nalgorithms but not for the **cleaning algorithms**. Use a `list` instead.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "fig, axs = plt.subplots(ncols=2, figsize=(10, 5))\n\nsampling_strategy = {0: 10, 1: 15, 2: 20}\nrus = RandomUnderSampler(sampling_strategy=sampling_strategy)\nX_res, y_res = rus.fit_resample(X, y)\ny_res.value_counts().plot.pie(autopct=autopct, ax=axs[0])\naxs[0].set_title(\"Under-sampling\")\n\nsampling_strategy = {0: 25, 1: 35, 2: 47}\nros = RandomOverSampler(sampling_strategy=sampling_strategy)\nX_res, y_res = ros.fit_resample(X, y)\ny_res.value_counts().plot.pie(autopct=autopct, ax=axs[1])\n_ = axs[1].set_title(\"Under-sampling\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## `sampling_strategy` as a `list`\n\nWhen `sampling_strategy` is a `list`, the list contains the targeted\nclasses. It is used only for **cleaning methods** and raise an error\notherwise.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "sampling_strategy = [0, 1, 2]\ntl = TomekLinks(sampling_strategy=sampling_strategy)\nX_res, y_res = tl.fit_resample(X, y)\nax = y_res.value_counts().plot.pie(autopct=autopct)\n_ = ax.set_title(\"Cleaning\")"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## `sampling_strategy` as a callable\n\nWhen callable, function taking `y` and returns a `dict`. The keys\ncorrespond to the targeted classes. The values correspond to the desired\nnumber of samples for each class.\n\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def ratio_multiplier(y):\n    from collections import Counter\n\n    multiplier = {1: 0.7, 2: 0.95}\n    target_stats = Counter(y)\n    for key, value in target_stats.items():\n        if key in multiplier:\n            target_stats[key] = int(value * multiplier[key])\n    return target_stats\n\n\nX_res, y_res = RandomUnderSampler(sampling_strategy=ratio_multiplier).fit_resample(X, y)\nax = y_res.value_counts().plot.pie(autopct=autopct)\nax.set_title(\"Under-sampling\")\nplt.show()"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.10.4"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}